#import data
ex1 <- read_excel("Chapter4_exercises_data.xls", sheet = "Exercise 6", range = "A1:C153")
colnames(ex1) <- c('date', 'actual_growth', 'greenbook_growth_forecast')
Time Series Plots
#plot of realized values
actual_growth_ts <- ts(ex1$actual_growth, start = 1969, freq = 4)
plot(actual_growth_ts, main="Actual Growth", xlab = "Year", ylab = "Rate")
#plot of forecasts
greenbook_growth_forecast_ts <- ts(ex1$greenbook_growth_forecast, start = 1969, freq = 4)
plot(greenbook_growth_forecast_ts, main="Greenbook Growth Forecast", xlab = "Year", ylab = "Rate")
#plot of forecast errors
forecast_errors <- actual_growth_ts - greenbook_growth_forecast_ts
plot(forecast_errors, main="Forecast Errors", xlab = "Year", ylab = "Rate")
All three plots are first order stationary, with mean reversion and changes in variance. There is lots of volatility up until 1985 due to the large spikes, but it becomes more stable later on.
Descriptive Statistics
summary(actual_growth_ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.0497 0.3475 0.7679 0.7539 1.1741 3.9343
summary(greenbook_growth_forecast_ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.1963 0.4901 0.6683 0.6765 0.9914 2.0604
summary(forecast_errors)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.48030 -0.37236 0.05326 0.07746 0.48582 2.85204
Actual growth shows the widest range, indicating more variability compared to the forecasted growth. The mean of actual growth is 0.75 while the mean of the forecasted growth is 0.68, suggesting that actual values tend to be slightly higher on average. Both actual growth and forecasted growth have means and medians that are nearly the same. The forecast errors show a slight positive skew, meaning that forecasts tend to underestimate actual growth.
ACF/PACF
par(mfrow=c(1,2))
#acf/pacf of realized values
acf(actual_growth_ts)
pacf(actual_growth_ts)
#acf/pacf of forecasts
acf(greenbook_growth_forecast_ts)
pacf(greenbook_growth_forecast_ts)
#acf/pacf of forecast errors
acf(forecast_errors)
pacf(forecast_errors)
par(mfrow=c(1,1))
There is time dependency observed for both actual growth and forecasted growth, as observed by the significant spikes in the ACF and PACF plots. The autocorrelation in the forecasted growth is more evident, which makes sense since forecasts are limited in their ability to predict sudden changes. The forecast errors doesn’t show autocorrelation, suggesting that the errors are random and don’t depend on past values.
#expected value of the forecast error
mean(forecast_errors)
## [1] 0.07746318
The expected value of the forecast errors is 0.077, which is basically zero.
#run regression of forecast errors on several lags
reg <- dynlm(forecast_errors ~ forecast_errors + L(forecast_errors, 1) + L(forecast_errors, 2) + L(forecast_errors, 3))
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared on
## the right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 1 in
## model.matrix: no columns are assigned
summary(reg)
##
## Time series regression with "ts" data:
## Start = 1969(4), End = 2006(4)
##
## Call:
## dynlm(formula = forecast_errors ~ forecast_errors + L(forecast_errors,
## 1) + L(forecast_errors, 2) + L(forecast_errors, 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.51624 -0.43181 -0.02828 0.35103 2.91904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.060552 0.062702 0.966 0.336
## L(forecast_errors, 1) 0.083250 0.083178 1.001 0.319
## L(forecast_errors, 2) 0.054550 0.082762 0.659 0.511
## L(forecast_errors, 3) 0.005634 0.082664 0.068 0.946
##
## Residual standard error: 0.7539 on 145 degrees of freedom
## Multiple R-squared: 0.0108, Adjusted R-squared: -0.009666
## F-statistic: 0.5277 on 3 and 145 DF, p-value: 0.6639
The p-value 0.66 > 0.05, so we fail to reject the null and conclude that the model is not statistically significant. Similarly, the p-values for the lagged terms are all > 0.05, which means that none of them significantly help predict the current forecast error. The adjusted r-squared is -0.009666, suggesting that the model doesn’t explain much of the variation in the forecast errors.
#perform an f-test
f_test <- linearHypothesis(reg, c("L(forecast_errors, 1) = 0", "L(forecast_errors, 2) = 0", "L(forecast_errors, 3) = 0"), test = "F")
f_test
## Linear hypothesis test
##
## Hypothesis:
## L(forecast_errors,0
## L(forecast_errors, 2) = 0
## L(forecast_errors, 3) = 0
##
## Model 1: restricted model
## Model 2: forecast_errors ~ forecast_errors + L(forecast_errors, 1) + L(forecast_errors,
## 2) + L(forecast_errors, 3)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 148 83.319
## 2 145 82.419 3 0.89986 0.5277 0.6639
Looking at the f-test results, we see that the overall regression has a p-value of 0.66, which is not statistically significant. This suggests that the forecast error is not predictable from its own past. This is similar to what we saw with the autocorrelation functions in Exercise 6, where the forecast errors didn’t show clear patterns and were likely random. This makes sense since one-quarter-ahead forecast errors represent the unexpected fluctuations.
a) #theoretical autocorrelation for lag 1 \[ y_t = 0.7 - 2\epsilon_{t-1} + 1.35\epsilon_{t-2} + \epsilon_t \\ \rho_1 = \frac{\gamma_1}{\gamma_0} = \frac{\sigma^2(\theta_1 + \theta_1\theta_2^2)} {\sigma^2(1 + \theta_1^2 + \theta_2^2)} \\ = \frac{1(-2 + (-2)(1.35))} {1 (1+ (-2)^2)+(1.35)^2} \\ = \frac{4.77}{6.82} \\ = -0.689 \]
#theoretical autocorrelation for lag 2 \[ \rho_2 = \frac{\gamma_2}{\gamma_0} \\ = \frac{\theta^2}{\sigma_2(1 + \theta_1^2 + \theta_2^2)} \\ = \frac{1.35}{6.82} \\ = 0.197 \] #theoretical autocorrelation for lags 3-10 \[ \rho_3 = ... = \rho_{10} = 0 \\ \]
b)
# ma.sim <- arima.sim(model = list(ma=c(-2, 1.35)), n = 100)
# ma.sim
# acf(ma.sim)
# pacf(ma.sim)
y = e = rnorm(100)
for (t in 3:100) y[t] = 0.7 - 2*e[t-1] + 1.35*e[t-2] + e[t]
# Plots and create objects
acf <- acf(y, lag.max=10, main="Random MA(2) - ACF")
pacf <- pacf(y, lag.max=10, main="Random MA(2) - PACF")
The sample ACF plot shows 2 significant spikes, and the sample PACF plot
shows exponential decay which is expected for an MA(2) process. This is
consistent with the theoretical autocorrelations calculated in a). The
ACF values on the plot for lags 1 and 2 seem pretty close to the
theoretical values calculated. For lags 3-10, the autocorrelations tend
to be approaching 0 as well.
# ma.sim<-arima.sim(model=list(ma=c(-2,1.35)),n=100)
# Calculate the unconditional mean
unconditional_mean <- mean(y)
# Plot ma.sim
plot(y, type = "l", col = "blue", xlab = "Time", ylab = "Value", main = "MA(2) Simulated Data with Unconditional Mean")
# Add the unconditional mean as a horizontal dashed line
abline(h = unconditional_mean, lty = 2, col = "red")
ma.model <- arma(y, order = c(0, 2))
summary(ma.model)
##
## Call:
## arma(x = y, order = c(0, 2))
##
## Model:
## ARMA(0,2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3768 -0.9044 0.2063 0.7622 2.7168
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## ma1 -1.47167 0.07717 -19.071 <2e-16 ***
## ma2 0.69926 0.07890 8.863 <2e-16 ***
## intercept 0.73181 0.02930 24.979 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Fit:
## sigma^2 estimated as 1.674, Conditional Sum-of-Squares = 162.51, AIC = 341.29
The estimated model is y_t = 0.67 - 1.63{t-1} + 0.86{t-2} + _t. The estimated parameters are very different from the theoretical values. We should expect some differences with the theoretical values because the sample size is small, but the differences that we observe are too large.
From the invertible process y_t = 0.67 - 1.63{t-1} + 0.86{t-2} + _t, we compute the following for h=1,2,3:
# Given values
epsilon_t <- 0.4
epsilon_t1 <- -1.2
# Compute f_t1
f_t1 <- 0.67 - 1.63 * epsilon_t + 0.86 * epsilon_t1
# Compute f_t2
f_t2 <- 0.67 + 0.86 * epsilon_t
# Compute f_t3
f_t3 <- 0.67
# Print the results
cat("f_t1 =", f_t1, "\n")
## f_t1 = -1.014
cat("f_t2 =", f_t2, "\n")
## f_t2 = 1.014
cat("f_t3 =", f_t3, "\n")
## f_t3 = 0.67
\[ \text{The process } y_t = 1.2 + 0.8\epsilon_{t-1} + \epsilon_t \text{ can be written as } y_t - 1.2 = (1 + 0.8L)\epsilon_t, \text{ which implies that } \frac{y_{t-1.2}}{1 - (-0.8L)} = \epsilon_t. \]
\[ \text{Since } 0.8 < 1, \text{ the ratio } \frac{1}{1 - (-0.8L)} \text{ is the limit of the following sequence: } \] \[ \frac{1}{1 - (-0.8L)} = 1 - 0.8L + 0.82L^2 - 0.83L^3 + \dotsb \] \[ \text{Thus, } (y_t - 1.2)\left(1 - 0.8L + 0.8^2L^2 - 0.8^3L^3 + \dotsb\right) = \epsilon_t. \]
\[ \text{Then, the autoregressive representation is } (y_t - 1.2) = 0.8(y_{t-1} - 1.2) - 0.8^2(y_{t-2} - 1.2) + 0.8^3(y_{t-3} - 1.2) - \dotsb + \epsilon_t. \]
\[ \text{The process } y_t = 1.2 + 1.25\epsilon_{t-1} + \epsilon_t \text{ is non-invertible, i.e., } \theta = 1.25 > 1. \text{ There is no limit to the ratio } \frac{1}{1 - (-1.25L)}, \text{ but we can transform this ratio by using the forward operator } F = \frac{1}{L}: \] \[ \frac{1}{1 - (-1.25L)} = \frac{1}{1 - (-1/0.8F)} = \frac{0.8F}{1-(-0.8F)} = 0.8F(1 - 0.8F + 0.8^2F^2 - 0.8^3F^3 + \dotsb) \] \[ \text{Then, the autoregressive representation is } (y_t - 1.2) \times 0.8F(1 - 0.8F + 0.8^2F^2 - 0.8^3F^3 + \dotsb) = \epsilon_t, \] \[ \text{which makes } y_t \text{ a function of the future values } y_{t+1}, y_{t+2}, \dotsc. \text{ Obviously, this representation is not useful for forecasting purposes. Thus, we choose the invertible process } y_t = 1.2 + 0.8\epsilon_{t-1} + \epsilon_t \text{ because the present is a function of the past.} \]
Download financial prices of your favorite stocks. Obtain the autocorrelation functions. Which process(es) will fit the financial returns to these stocks? Propose a model, estimate it, and forecast at several horizons.
nvda <- read.csv("NVDA2.csv")
nvda_ts <- ts(nvda$AdjClose, start = 2019, end = 2024, frequency = 251)
plot(nvda_ts)
acf(nvda_ts, lag.max=50, main = "ACF of Nvidia Stock Prices")
Given the stochastic trend in the data, the MA process is likely a
better fit to capture the volatility, particularly the noticeable sharp
rise in early 2020 followed by the subsequent sharp decline.
#AR(p) fits
ar1=arma(nvda_ts,order=c(1,0))
summary(ar1)
##
## Call:
## arma(x = nvda_ts, order = c(1, 0))
##
## Model:
## ARMA(1,0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.743761 -0.043911 -0.006683 0.038476 0.598190
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## ar1 0.995058 0.002637 377.409 <2e-16 ***
## intercept 0.010159 0.005750 1.767 0.0773 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Fit:
## sigma^2 estimated as 0.01264, Conditional Sum-of-Squares = 15.85, AIC = -1921.21
ar2=arma(nvda_ts,order=c(2,0))
summary(ar2)
##
## Call:
## arma(x = nvda_ts, order = c(2, 0))
##
## Model:
## ARMA(2,0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.744271 -0.044028 -0.006686 0.038675 0.597914
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## ar1 0.9959510 0.0282230 35.289 <2e-16 ***
## ar2 -0.0008814 0.0282069 -0.031 0.9751
## intercept 0.0101096 0.0057574 1.756 0.0791 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Fit:
## sigma^2 estimated as 0.01265, Conditional Sum-of-Squares = 15.85, AIC = -1918.28
ar3=arma(nvda_ts,order=c(5,0))
summary(ar3)
##
## Call:
## arma(x = nvda_ts, order = c(5, 0))
##
## Model:
## ARMA(5,0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.732180 -0.042576 -0.006815 0.038079 0.595544
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## ar1 0.996110 0.028212 35.308 <2e-16 ***
## ar2 -0.037354 0.039827 -0.938 0.3483
## ar3 0.035191 0.039828 0.884 0.3769
## ar4 0.029733 0.039824 0.747 0.4553
## ar5 -0.028616 0.028197 -1.015 0.3102
## intercept 0.010190 0.005773 1.765 0.0776 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Fit:
## sigma^2 estimated as 0.01265, Conditional Sum-of-Squares = 15.82, AIC = -1912.2
#AR(3) model would be the best fit because it has the lowest AIC value.
plot(nvda_ts,xlab='Year', ylab="Nvdia", lwd=2)
grid()
lines(ar1$fitted.values,col="blue",lwd=2,lty=2)
lines(ar2$fitted.values,col="seagreen2",lwd=2,lty=2)
lines(ar3$fitted.values,col="red",lwd=2,lty=2)
legend("topright",legend=c("Data","AR(3)","AR(2)","AR(1)"),text.col=1:4,bty="n")
#Forecast
ar3=Arima(nvda_ts,order=c(3,0,0))
plot(forecast(ar3,h=20),xlim=c(2019,2024)) #20-days ahead
plot(forecast(ar3,h=100),xlim=c(2019,2024)) #100-days ahead
plot(forecast(ar3,h=200),xlim=c(2019,2024)) #200-days ahead
#MA(q) Model Fitting
ma1=arma(nvda_ts,order=c(0,1))
summary(ma1)
##
## Call:
## arma(x = nvda_ts, order = c(0, 1))
##
## Model:
## ARMA(0,1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3966 -0.5145 -0.1285 0.4446 2.1682
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## ma1 0.931970 0.007504 124.20 <2e-16 ***
## intercept 1.812244 0.035037 51.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Fit:
## sigma^2 estimated as 0.4156, Conditional Sum-of-Squares = 521.19, AIC = 2465.58
ma2=arma(nvda_ts,order=c(0,2))
summary(ma2)
##
## Call:
## arma(x = nvda_ts, order = c(0, 2))
##
## Model:
## ARMA(0,2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.40883 -0.30698 -0.08334 0.25745 1.72426
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## ma1 1.47808 0.02145 68.92 <2e-16 ***
## ma2 0.80387 0.01458 55.15 <2e-16 ***
## intercept 1.79223 0.03747 47.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Fit:
## sigma^2 estimated as 0.1667, Conditional Sum-of-Squares = 208.94, AIC = 1320
ma5=arma(nvda_ts,order=c(0,5))
summary(ma5)
##
## Call:
## arma(x = nvda_ts, order = c(0, 5))
##
## Model:
## ARMA(0,5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.34391 -0.13534 -0.03802 0.12243 1.05387
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## ma1 1.67197 0.03165 52.82 <2e-16 ***
## ma2 1.97205 0.05029 39.21 <2e-16 ***
## ma3 1.73602 0.04465 38.88 <2e-16 ***
## ma4 1.14216 0.03304 34.57 <2e-16 ***
## ma5 0.45993 0.02468 18.64 <2e-16 ***
## intercept 1.70701 0.04708 36.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Fit:
## sigma^2 estimated as 0.04609, Conditional Sum-of-Squares = 57.88, AIC = -288.51
All the coefficients in the MA models are statistically significant. MA(3) model would be the best fit because it has the lowest AIC value. However, the lower AIC values in the MA models compared to the AR model indicate that the MA models better capture the data’s underlying structure, leaving less unexplained random error.
#MA(q) fits
plot(nvda_ts,xlab='Year', ylab="NVIDIA Stock Price", lwd=2)
grid()
lines(ma1$fitted.values,col="blue",lwd=1)
lines(ma2$fitted.values,col="seagreen2",lwd=1)
lines(ma5$fitted.values,col="red",lwd=1)
legend("topright",legend=c("Data","MA(5)","MA(2)","MA(1)"),text.col=1:4,bty="n")
#Forecast
ma5=Arima(nvda_ts,order=c(0,0,5))
plot(forecast(ma5,h=50)) #50 days ahead
plot(forecast(ma5,h=200)) #200 days ahead
autoplot(plastics) + ylab("Sales (Thousands)") + ggtitle("Monthly Sales of Product A for Plastics Manufacturer")
ggseasonplot(plastics)
ggsubseriesplot(plastics)
There are yearly seasonal fluctuations in the time series of sales of product A, where sales peak around the summer (mid-year) and are lowest at the beginning and end of each year. There is an overall increasing trend over the five years in the time series.
#perform multiplicative decomposition
plastics_dcmp <- decompose(plastics, "multiplicative")
#plot multiplicative decomposition
autoplot(plastics_dcmp) + ggtitle("Sales of Product A for Plastics Manufacturer — Multiplicative Decomposition")
The results of the multiplicative decomposition support the graphical interpretation from part a, as there is an increasing trend over the years and evident seasonality with peaks in the summer within each year.
seasonal_plastics <- plastics_dcmp$seasonal
seasonally_adjusted_plastics <- plastics/seasonal_plastics
autoplot(plastics, series = "Data") + autolayer(seasonally_adjusted_plastics, series = "Seasonally Adjusted") + ggtitle("Sales of Product A for Plastics Manufacturer") + ylab("Sales (Thousands)")
#remove the trend and seasonality
trend_plastics <- plastics_dcmp$trend
detrend_plastics <- plastics/trend_plastics
seasonal_plastics <- plastics_dcmp$seasonal
seasonally_adjusted_plastics <- plastics/seasonal_plastics
#detrended and seasonally adjusted series
detrend_seas_adj_plastics <- plastics/plastics_dcmp$trend/plastics_dcmp$seasonal
autoplot(detrend_seas_adj_plastics, main = "Detrended and Seasonally Adjusted - Multiplicative")
#compute residuals
residuals_plastics <- plastics / (trend_plastics * seasonal_plastics)
# Plot ACF and PACF of residuals
tsdisplay(residuals_plastics, lag.max = 20, main = "ACF and PACF of Residuals - Multiplicative")
# Seasonally Adjust: Y / S
autoplot(seasonally_adjusted_plastics)
plastics2 <- plastics
plastics2[30] <- plastics2[30] + 500
plastics_dcmp2 <- decompose(plastics2, "multiplicative")
seasonal_plastics2 <- plastics_dcmp2$seasonal
seasonally_adjusted_plastics2 <- plastics2/seasonal_plastics2
autoplot(plastics2, series = "Data") + autolayer(seasonally_adjusted_plastics2, series = "Seasonally Adjusted") + ggtitle("Sales of Product A for Plastics Manufacturer (with outlier)") + ylab("Sales (Thousands)")
The outlier creates a large upward spike in the data.
#f
plastics3 <- plastics
plastics3[55] <- plastics3[55] + 500
plastics_dcmp3 <- decompose(plastics3, "multiplicative")
seasonal_plastics3 <- plastics_dcmp3$seasonal
seasonally_adjusted_plastics3 <- plastics3/seasonal_plastics3
autoplot(plastics3, series = "Data") + autolayer(seasonally_adjusted_plastics3, series = "Seasonally Adjusted") + ggtitle("Sales of Product A for Plastics Manufacturer (with outlier)") + ylab("Sales (Thousands)")
Having an outlier near the end rather than in the middle unsurprisingly results in a spike towards the end of the data rather than in the middle. This seems to affect the overall trend more and the seasonality less than if the outlier were in the middle.
# STL decomposition with fixed seasonality
stl_brick_fixed_st <- stl(bricksq,
s.window = "periodic",
robust = TRUE)
# STL decomposition with changing seasonality
stl_brick_changing_st <- stl(bricksq,
s.window = 5,
robust = TRUE)
# plot decomposed data
autoplot(stl_brick_fixed_st) +
ggtitle("Brick production data decomposed by STL with fixed seasonality")
autoplot(stl_brick_changing_st) +
ggtitle("Brick production data decomposed by STL with changing seasonality")
According to the plot above, the fixed seasonality has seasonal
component that is additive, and the changing seasonality has seasonal
component that is multiplicative. Moreover, fixed seasonality has
slightly noisier remainder than changing seasonality.
# plot data which are decomposed by STL with fixed seasonality
autoplot(bricksq, series = "Data") +
autolayer(trendcycle(stl_brick_fixed_st),
series = "Trend-cycle") +
autolayer(seasadj(stl_brick_fixed_st),
series = "Seasonally Adjusted Data") +
ggtitle("Quarterly clay brick production in Australia",
subtitle = "-decomposed by STL with fixed seasonality") +
scale_color_manual(values = c("gray", "red", "blue"),
breaks = c("Data", "Trend-cycle", "Seasonally Adjusted Data"))
# plot data which are decomposed by STL with changing seasonality
autoplot(bricksq, series = "Data") +
autolayer(trendcycle(stl_brick_fixed_st),
series = "Trend-cycle") +
autolayer(seasadj(stl_brick_fixed_st),
series = "Seasonally Adjusted Data") +
ggtitle("Quarterly clay brick production in Australia",
subtitle = "-decomposed by STL with changing seasonality") +
scale_color_manual(values = c("gray", "red", "blue"),
breaks = c("Data", "Trend-cycle", "Seasonally Adjusted Data"))
stl_brick_fixed_st %>% seasadj() %>% naive() %>% autoplot() +
ggtitle(label = "Naive forecast of seasonally adjusted brick data",
subtitle = "after STL decomposition with fixed seasonality")
stl_brick_changing_st %>% seasadj() %>% naive() %>% autoplot() +
ggtitle(label = "Naive forecast of seasonally adjusted brick data",
subtitle = "after STL decomposition with changing seasonality")
From the two forecasts, we can see that the prediction intervals of seasonally adjusted data decomposed by STL with changing seasonality have slightly smaller range than the one with fixed seasonality. It happened because the variance of the remainder component decreased when the seasonality can be changed.
stlf_brick <- stlf(bricksq) #reseasonalize
autoplot(stlf_brick) #giving forecasts for original data
#
e. Do the residuals look uncorrelated?
checkresiduals(stlf_brick)
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(M,N,N)
## Q* = 44.704, df = 8, p-value = 4.186e-07
##
## Model df: 0. Total lags used: 8
The residuals doesn’t look uncorrelated. They are correlated with each other shown by the small p-value.
stlf_brick_robust <- stlf(bricksq, robust = TRUE)
autoplot(stlf_brick_robust)
checkresiduals(stlf_brick_robust)
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(M,N,N)
## Q* = 27.278, df = 8, p-value = 0.000633
##
## Model df: 0. Total lags used: 8
With a robust stl decomposition, it looks like the autocorrelations became lower generally, but there are still some high values left.
trainset_brick <- subset(bricksq,
end = length(bricksq) - 8)
testset_brick <- subset(bricksq,
start = length(bricksq) - 7)
snaive_brick <- snaive(trainset_brick)
stlf_brick_part <- stlf(trainset_brick, robust = TRUE)
# plot data and forecast results
autoplot(bricksq, series = "Original data") +
geom_line(size = 1) +
autolayer(stlf_brick_part, PI = FALSE, size = 1,
series = "stlf") +
autolayer(snaive_brick, PI = FALSE, size = 1,
series = "snaive") +
scale_color_manual(values = c("gray50", "blue", "red"),
breaks = c("Original data", "stlf", "snaive")) +
scale_x_continuous(limits = c(1990, 1994.5)) +
scale_y_continuous(limits = c(300, 600)) +
guides(colour = guide_legend(title = "Data")) +
ggtitle("Forecast from stlf and snaive functions") +
annotate(
"rect",
xmin=1992.75,xmax=1994.5,ymin=-Inf,ymax=Inf,
fill="lightgreen",alpha = 0.3
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
## Warning: Removed 136 rows containing missing values or values outside the scale range
## (`geom_line()`).
We
can see from the plot that the forecasts from stlf function are more
similar to the original data than the forecasts from snaive function.
Unlike snaive function, stlf function can also use trend, and its
seasonality can change over time. The test set have trend with
seasonality. Therefore stlf function was better than snaive function to
predict brick production amount of near future.
autoplot(writing)
The data exhibits both an upward trend and yearly seasonality. Thus, the “rwdrift” method is more appropriate than the “naive” method.
fit <- stlf(writing, s.window = "periodic", robust = TRUE, method = "rwdrift")
autoplot(fit, main = "Forecast without Box-Cox Transformation") + xlab("Month") + ylab("Sales")
fit2 <- stlf(writing, s.window = "periodic", robust = TRUE,lambda = BoxCox.lambda(writing))
autoplot(fit2, main = "Forecast with Box-Cox Transformation") + xlab("Month") + ylab("Sales")
Based on the two forecasts, the one with a Box-Cox transformation seems to perform better and has smaller prediction intervals. The transformation improves the forecast because the seasonality varies over time in the original data.
str(fancy)
## Time-Series [1:84] from 1987 to 1994: 1665 2398 2841 3547 3753 ...
head(fancy)
## Jan Feb Mar Apr May Jun
## 1987 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74
autoplot(fancy)
There
is increasing trend so it would be better to use rwdrift method for
forecasting non-seasonal component. And it would be better to do Box-Cox
transformation to make the variation in the data about the same across
the whole series.
stlf_fancy <- stlf(fancy,
s.window = 13,
robust = TRUE,
lambda = BoxCox.lambda(fancy),
method = "rwdrift")
autoplot(stlf_fancy)
The prediction intervals increase dramatically because of Box-Cox
transformation. But without the transformation, the forecasts are
unreasonable.